<a href="https://github.com/dd-consulting">
<img src="../reference/GZ_logo.png" width="60" align="right">
</a>
<h1>
One-Stop Analytics: Hypothesis Test
</h1>
Autism spectrum disorder (ASD) is a developmental disability that can cause significant social, communication and behavioral challenges. CDC is committed to continuing to provide essential data on ASD, search for factors that put children at risk for ASD and possible causes, and develop resources that help identify children with ASD as early as possible.
Doctors cited better awareness among parents and preschool teachers, leading to early referrals for diagnosis.
https://www.gov.sg/news/content/today-online-more-preschoolers-diagnosed-with-developmental-issues
<a href="">
</a>
https://www.cdc.gov/ncbddd/autism/data/index.html
<a href="">
</a>
<h3>
Hypothesis Test - Data
</h3>
Obtain current R working directory
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
Set new R working directory
# setwd("/media/sf_vm_shared_folder/git/DDC/DDC-ASD/model_R")
# setwd('~/Desktop/admin-desktop/vm_shared_folder/git/DDC-ASD/model_R')
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"
Read in CSV data, storing as R dataframe
# Read back in above saved file:
ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# Convert Year_Factor to ordered.factor
ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE)
ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE,
levels=c("Low", "High"))
ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE,
levels=c("Low", "Medium", "High", "Very High"))
# library(repr)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
<h3>
Hypothesis Test - Mean - Z Test & t.test()
</h3>
Two sample mean test (t test):
Prepare samples
# Filter and create dataframe of different data sources, for easy data access
ASD_State_ADDM <- subset(ASD_State, Source == 'addm')
# Aggregate all US. States into three regions: East, West, Other.
ASD_State_ADDM$State_Region2 <- "Other"
ASD_State_ADDM$State_Region2[ASD_State_ADDM$State_Region %in%
c("D3 East North Central", "D6 East South Central")] <- "East"
ASD_State_ADDM$State_Region2[ASD_State_ADDM$State_Region %in%
c("D4 West North Central", "D7 West South Central")] <- "West"
boxplot(ASD_State_ADDM$Prevalence ~ ASD_State_ADDM$State_Region2)
# ----------------------------------
# Hypothesis Test - Mean - Z Test & t.test()
# ----------------------------------
# Create sample 1 of West states
sample_1 = ASD_State_ADDM$Prevalence[ASD_State_ADDM$State_Region2 == "West"]
# Create sample 2 of East states
sample_2 = ASD_State_ADDM$Prevalence[ASD_State_ADDM$State_Region2 == "East"]
# variance test : Equal variance or Unequal variance?
var.test(sample_1, sample_2)
##
## F test to compare two variances
##
## data: sample_1 and sample_2
## F = 1.4875, num df = 12, denom df = 12, p-value = 0.5019
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4538834 4.8749547
## sample estimates:
## ratio of variances
## 1.487502
# t test : Equal variance t test
t.test(sample_1, sample_2, var.equal = TRUE)
##
## Two Sample t-test
##
## data: sample_1 and sample_2
## t = 2.7482, df = 24, p-value = 0.01119
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.107134 7.785174
## sample estimates:
## mean of x mean of y
## 12.323077 7.876923
# Visualise samples:
plot(density(sample_2), col="orange", xlab="Prevalence", main="PDF",
lwd=2, lty=1, xlim=c(0,30))
lines(density(sample_1), col="blue", lwd=2, lty=1)
https://www.statsdirect.co.uk/help/Default.htm#parametric_methods/unpaired_t.htm
https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/7-t-tests
#############################################################
# Deep Dive : Equal variance t test
#############################################################
# sample size
cal_n1 = length(sample_1)
cat("\ncal_n1:", cal_n1)
##
## cal_n1: 13
cal_n2 = length(sample_2)
cat("\ncal_n2:", cal_n2)
##
## cal_n2: 13
# degree of freedom
cal_df = cal_n1 + cal_n2 - 2
cat("\ncal_df:", cal_df)
##
## cal_df: 24
# pooled standard deviation
cal_s = sqrt(((cal_n1 - 1)*sd(sample_1)^2 + (cal_n2 - 1)*sd(sample_2)^2)
/(cal_n1 + cal_n2 -2))
cat("\ncal_s :", cal_s)
##
## cal_s : 4.124652
# combined standard error
cal_se = cal_s * sqrt(1/cal_n1 + 1/cal_n2)
cat("\ncal_se:", cal_se)
##
## cal_se: 1.617822
# t statistic using combined standard error
cal_t = (mean(sample_1)-mean(sample_2)) / cal_se
cat("\ncal_t :", cal_t)
##
## cal_t : 2.748235
# p-value
cal_p = (1-pt(q = cal_t, df = cal_df)) * 2
cat("\ncal_p :", cal_p)
##
## cal_p : 0.01119334
# 95% CI using "pooled standard deviation"
# sample_1 Upper CI
mn_upper_ci = mean(sample_1) + qt(p = 0.975, df = cal_df)*cal_s/sqrt(cal_n1+cal_n2)
cat("\nmn_upper_ci:", mn_upper_ci)
##
## mn_upper_ci: 13.99259
# sample_1 Lower CI
mn_lower_ci = mean(sample_1) - qt(p = 0.975, df = cal_df)*cal_s/sqrt(cal_n1+cal_n2)
cat("\nmn_lower_ci:", mn_lower_ci)
##
## mn_lower_ci: 10.65357
# sample_2 Upper CI
fn_upper_ci = mean(sample_2) + qt(p = 0.975, df = cal_df)*cal_s/sqrt(cal_n1+cal_n2)
cat("\nfn_upper_ci:", fn_upper_ci)
##
## fn_upper_ci: 9.546433
# sample_2 Lower CI
fn_lower_ci = mean(sample_2) - qt(p = 0.975, df = cal_df)*cal_s/sqrt(cal_n1+cal_n2)
cat("\nfn_lower_ci:", fn_lower_ci)
##
## fn_lower_ci: 6.207413
# Difference with 95% CI
cat("\nDifference's Upper CI:", mn_upper_ci - fn_lower_ci)
##
## Difference's Upper CI: 7.785174
cat("\nDifference's Lower CI:", mn_lower_ci - fn_upper_ci)
##
## Difference's Lower CI: 1.107134
# Difference with 95% CI using combined standard error
# Difference's Upper CI
dif_upper_ci = mean(sample_1)-mean(sample_2) + qt(p = 0.975, df = cal_df)*cal_se
cat("\nDifference's Upper CI:", dif_upper_ci)
##
## Difference's Upper CI: 7.785174
# Difference's Lower CI
dif_lower_ci = mean(sample_1)-mean(sample_2) - qt(p = 0.975, df = cal_df)*cal_se
cat("\nDifference's Lower CI:", dif_lower_ci)
##
## Difference's Lower CI: 1.107134
# Visualise CIs: sample_1 & sample_2
plot(density(sample_2), col="orange", xlab="Prevalence", lwd=2, ylim=c(0,0.3),
main="East vs. West ASD Prevalence PDF")
lines(density(sample_1), col="blue", lwd=2)
# Overlay sample_1 CI
abline(v= mean(sample_1), col="blue", lwd=2, lty=1)
abline(v=mn_upper_ci, col="blue", lwd=2, lty=3)
abline(v=mn_lower_ci, col="blue", lwd=2, lty=3)
# Overlay sample_2 CI
abline(v= mean(sample_2), col="orange", lwd=2, lty=1)
abline(v=fn_upper_ci, col="orange", lwd=2, lty=3)
abline(v=fn_lower_ci, col="orange", lwd=2, lty=3)
# Overlay Difference's CI
abline(v=mean(sample_1)-mean(sample_2), col="darkgrey", lwd=2, lty=1)
abline(v=dif_upper_ci, col="darkgrey", lwd=2, lty=3)
abline(v=dif_lower_ci, col="darkgrey", lwd=2, lty=3)
cal_xseq <- seq(-5,15,0.01)
cal_pdf <- dnorm(x = cal_xseq, mean = mean(sample_1)-mean(sample_2), sd = cal_se)
lines(cal_xseq, cal_pdf, col="darkgrey", type="l",lwd=2)
# Visualise Difference's CIs alone:
plot(cal_xseq, cal_pdf, col="darkgrey", xlab="Prevalence Difference",
ylab="Density", type="l",lwd=2, main="Normal PDF")
# Overlay Difference's CI
abline(v=mean(sample_1)-mean(sample_2), col="darkgrey", lwd=2, lty=1)
abline(v=dif_upper_ci, col="darkgrey", lwd=2, lty=3)
abline(v=dif_lower_ci, col="darkgrey", lwd=2, lty=3)
<h3>
Hypothesis Test - Proportion - prop.test()
</h3>
Two sample proportion test (t test):
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
Prepare samples
# Filter and create dataframe of different data sources, for easy data access
ASD_State_SPED <- subset(ASD_State, Source == 'sped')
# ----------------------------------
# Use a sample of one US. State's ASD prevalence (proportion) to estimate:
# Prevalence of THAT US. State's ALL Children (the *Population*) [Source SPED, Year 2016]
# ----------------------------------
# No. Children with ASD
ASD <- ASD_State_SPED$Numerator_ASD[ASD_State_SPED$Year == 2016]
#
str(ASD)
## int [1:50] 6140 1204 10746 4181 79041 5902 7391 1383 782 30920 ...
# No. Children with ASD of first US. State (AL-Alabama)
ASD[1]
## [1] 6140
# No. Children surveyed
Children <- ASD_State_SPED$Denominator[ASD_State_SPED$Year == 2016]
#
str(Children)
## int [1:50] 674701 119217 1033241 440130 5686400 808556 479961 124609 65732 2555399 ...
# No. Children surveyed of first US. State (AL-Alabama)
Children[1]
## [1] 674701
Two group hypothesis test : proportions (Prevalvence) among two US. States
# ----------------------------------
# Hypothesis Test - Proportion - prop.test()
# ----------------------------------
# Two group hypothesis test : proportions (Prevalvence) among two US. States
prop.test(ASD[1:2], Children[1:2])
##
## 2-sample test for equality of proportions with continuity correction
##
## data: ASD[1:2] out of Children[1:2]
## X-squared = 10.922, df = 1, p-value = 0.0009503
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0016149661 -0.0003828408
## sample estimates:
## prop 1 prop 2
## 0.009100327 0.010099231
Multiple group hypothesis test : proportions (Prevalvence) among all US. Statess
# ----------------------------------
# Hypothesis Test - Proportion - prop.test()
# ----------------------------------
# Multiple group hypothesis test : proportions (Prevalvence) among all US. States
prop.test(ASD, Children)
##
## 50-sample test for equality of proportions without continuity
## correction
##
## data: ASD out of Children
## X-squared = 28593, df = 49, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5 prop 6
## 0.009100327 0.010099231 0.010400284 0.009499466 0.013900007 0.007299433
## prop 7 prop 8 prop 9 prop 10 prop 11 prop 12
## 0.015399168 0.011098717 0.011896793 0.012099872 0.010300278 0.008299863
## prop 13 prop 14 prop 15 prop 16 prop 17 prop 18
## 0.009498530 0.011000237 0.014200361 0.001399357 0.008200786 0.009299486
## prop 19 prop 20 prop 21 prop 22 prop 23 prop 24
## 0.006900015 0.016696937 0.011900547 0.017399589 0.011199751 0.019400428
## prop 25 prop 26 prop 27 prop 28 prop 29 prop 30
## 0.009499812 0.012399961 0.005500221 0.010800730 0.012698882 0.014097784
## prop 31 prop 32 prop 33 prop 34 prop 35 prop 36
## 0.014100222 0.008501570 0.012999924 0.011199812 0.009796534 0.012499686
## prop 37 prop 38 prop 39 prop 40 prop 41 prop 42
## 0.008899480 0.015100037 0.016899883 0.015201438 0.009599935 0.008997355
## prop 43 prop 44 prop 45 prop 46 prop 47 prop 48
## 0.009500416 0.010199940 0.008699269 0.012103956 0.014200156 0.011200227
## prop 49 prop 50
## 0.008601179 0.009299132
<h3>
Hypothesis Test - Case Study: Understanding the newborn gender differences in Singapore
</h3>
https://data.gov.sg/dataset/newborns-by-gender
Above tabular CSV data from Data.gov.sg https://data.gov.sg/dataset/e55726f8-6d91-4cad-8e40-d4aa10ec7877/download
Read in CSV data, storing as R dataframe
df_newborns <- read.csv("../reference/newborns-by-gender/newborns-by-gender-from-1984-2018.csv")
head(df_newborns)
## year gender newborns
## 1 1984 Males 21661
## 2 1984 Females 19894
## 3 1984 Unknown 1
## 4 1985 Males 22027
## 5 1985 Females 20452
## 6 1985 Unknown 5
tail(df_newborns)
## year gender newborns
## 100 2017 Males 20408
## 101 2017 Females 19207
## 102 2017 Unknown 0
## 103 2018 Males 20093
## 104 2018 Females 18945
## 105 2018 Unknown 1
# Example of filtering data
subset(df_newborns, year == 2018, select = c(year, gender, newborns))
## year gender newborns
## 103 2018 Males 20093
## 104 2018 Females 18945
## 105 2018 Unknown 1
# year 2018 Male newborn number
subset(df_newborns, year == 2018 & gender == "Males")$newborns
## [1] 20093
# year 2018 Female newborn number
subset(df_newborns, year == 2018 & gender == "Females")$newborns
## [1] 18945
One sample proportion test:
# year 2018 Male newborn proportion vs. expected 50% ratio (null probability 0.5)
mn = subset(df_newborns, year == 2018 & gender == "Males")$newborns # Male newborns
fn = subset(df_newborns, year == 2018 & gender == "Females")$newborns # Female newborns
total = mn + fn
prop.test(x = mn, n = total)
##
## 1-sample proportions test with continuity correction
##
## data: mn out of total, null probability 0.5
## X-squared = 33.701, df = 1, p-value = 6.428e-09
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.5097318 0.5196725
## sample estimates:
## p
## 0.5147036
# prop.test(x = mn, n = total, alternative = "greater")
Two sample proportion test:
subset(df_newborns[order(-df_newborns$year),], year %in% c(2018, 1984) & gender == "Males")
## year gender newborns
## 103 2018 Males 20093
## 1 1984 Males 21661
subset(df_newborns[order(-df_newborns$year),], year %in% c(2018, 1984) & gender == "Females")
## year gender newborns
## 104 2018 Females 18945
## 2 1984 Females 19894
# year 2018 Male newborn proportion vs. year 1984 Male newborn proportion
mn = subset(df_newborns[order(-df_newborns$year),], year %in% c(2018, 1984)
& gender == "Males")$newborns # Male newborns
fn = subset(df_newborns[order(-df_newborns$year),], year %in% c(2018, 1984)
& gender == "Females")$newborns # Female newborns
total = mn + fn
prop.test(x = mn, n = total)
##
## 2-sample test for equality of proportions with continuity correction
##
## data: mn out of total
## X-squared = 3.4404, df = 1, p-value = 0.06362
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0134849792 0.0003702646
## sample estimates:
## prop 1 prop 2
## 0.5147036 0.5212610
# prop.test(x = mn, n = total, alternative = "less")
Two sample mean test (t test):
mn_all = subset(df_newborns[order(-df_newborns$year),], gender == "Males")
dim(mn_all)
## [1] 35 3
head(mn_all)
## year gender newborns
## 103 2018 Males 20093
## 100 2017 Males 20408
## 97 2016 Males 21315
## 94 2015 Males 21755
## 91 2014 Males 21679
## 88 2013 Males 20528
fn_all = subset(df_newborns[order(-df_newborns$year),], gender == "Females")
dim(fn_all)
## [1] 35 3
head(fn_all)
## year gender newborns
## 104 2018 Females 18945
## 101 2017 Females 19207
## 98 2016 Females 19936
## 95 2015 Females 20430
## 92 2014 Females 20552
## 89 2013 Females 19191
cat("\nMale average newborns:", mean(mn_all$newborns)) # Male average
##
## Male average newborns: 22319.94
cat("\nFemale average newborns:", mean(fn_all$newborns)) # Female average
##
## Female average newborns: 20836.89
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1, 2))
# Male
plot(density(mn_all$newborns), col="blue", lwd=2)
hist(mn_all$newborns, probability = T, add = T)
# Female
plot(density(fn_all$newborns), col="orange", lwd=2)
hist(fn_all$newborns, probability = T, add = T)
par(mfrow=c(1, 1))
plot(density(fn_all$newborns), col="orange", lwd=2)
lines(density(mn_all$newborns), col="blue", lwd=2)
# t test with equal variance:
t.test(mn_all$newborns, fn_all$newborns, paired = FALSE, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mn_all$newborns and fn_all$newborns
## t = 2.6935, df = 68, p-value = 0.008898
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 384.3287 2581.7855
## sample estimates:
## mean of x mean of y
## 22319.94 20836.89
# t.test(mn_all$newborns, fn_all$newborns, paired = FALSE, var.equal = TRUE, alternative = "greater")
#############################################################
# Deep Dive : Equal variance t test
#############################################################
# sample size
cal_n1 = length(mn_all$newborns)
cat("\ncal_n1:", cal_n1)
##
## cal_n1: 35
cal_n2 = length(fn_all$newborns)
cat("\ncal_n2:", cal_n2)
##
## cal_n2: 35
# degree of freedom
cal_df = cal_n1 + cal_n2 - 2
cat("\ncal_df:", cal_df)
##
## cal_df: 68
# pooled standard deviation
cal_s = sqrt(((cal_n1 - 1)*sd(mn_all$newborns)^2 + (cal_n2 - 1)*sd(fn_all$newborns)^2)/(cal_n1 + cal_n2 -2))
cat("\ncal_s :", cal_s)
##
## cal_s : 2303.374
# combined standard error
cal_se = cal_s*sqrt(1/cal_n1 + 1/cal_n2)
cat("\ncal_se:", cal_se)
##
## cal_se: 550.6116
# t statistic using combined standard error
cal_t = (mean(mn_all$newborns)-mean(fn_all$newborns))/cal_se
cat("\ncal_t :", cal_t)
##
## cal_t : 2.693472
# p-value
cal_p = (1-pt(q = cal_t, df = cal_df))*2
cat("\ncal_p :", cal_p)
##
## cal_p : 0.00889784
# 95% CI using "pooled standard deviation"
# Male Upper CI
mn_upper_ci = mean(mn_all$newborns) + qt(p = 0.975, df = cal_df)*cal_s/sqrt(cal_n1+cal_n2)
cat("\nmn_upper_ci:", mn_upper_ci)
##
## mn_upper_ci: 22869.31
# Male Lower CI
mn_lower_ci = mean(mn_all$newborns) - qt(p = 0.975, df = cal_df)*cal_s/sqrt(cal_n1+cal_n2)
cat("\nmn_lower_ci:", mn_lower_ci)
##
## mn_lower_ci: 21770.58
# Female Upper CI
fn_upper_ci = mean(fn_all$newborns) + qt(p = 0.975, df = cal_df)*cal_s/sqrt(cal_n1+cal_n2)
cat("\nfn_upper_ci:", fn_upper_ci)
##
## fn_upper_ci: 21386.25
# Female Lower CI
fn_lower_ci = mean(fn_all$newborns) - qt(p = 0.975, df = cal_df)*cal_s/sqrt(cal_n1+cal_n2)
cat("\nfn_lower_ci:", fn_lower_ci)
##
## fn_lower_ci: 20287.52
# Difference with 95% CI
cat("\nDifference's Upper CI:", mn_upper_ci - fn_lower_ci)
##
## Difference's Upper CI: 2581.786
cat("\nDifference's Lower CI:", mn_lower_ci - fn_upper_ci)
##
## Difference's Lower CI: 384.3287
# Difference with 95% CI using combined standard error
# Difference's Upper CI
dif_upper_ci = mean(mn_all$newborns)-mean(fn_all$newborns) + qt(p = 0.975, df = cal_df)*cal_se
cat("\nDifference's Upper CI:", dif_upper_ci)
##
## Difference's Upper CI: 2581.786
# Difference's Lower CI
dif_lower_ci = mean(mn_all$newborns)-mean(fn_all$newborns) - qt(p = 0.975, df = cal_df)*cal_se
cat("\nDifference's Lower CI:", dif_lower_ci)
##
## Difference's Lower CI: 384.3287
# Visualise CIs: Male & Female
plot(density(fn_all$newborns), col="orange", xlab="Prevalence", main="PDF", lwd=2)
lines(density(mn_all$newborns), col="blue", lwd=2)
# Overlay Male CI
abline(v= mean(mn_all$newborns), col="blue", lwd=2, lty=1)
abline(v=mn_upper_ci, col="blue", lwd=2, lty=3)
abline(v=mn_lower_ci, col="blue", lwd=2, lty=3)
# Overlay Female CI
abline(v= mean(fn_all$newborns), col="orange", lwd=2, lty=1)
abline(v=fn_upper_ci, col="orange", lwd=2, lty=3)
abline(v=fn_lower_ci, col="orange", lwd=2, lty=3)
# Visualise CIs: Difference between Male and Female
plot(density(mn_all$newborns-fn_all$newborns), col="cyan", xlab="Prevalence Difference",
main="PDF", lwd=2, lty=1, xlim=c(0,3000))
# Overlay Difference's CI
abline(v=mean(mn_all$newborns)-mean(fn_all$newborns), col="darkgrey", lwd=2, lty=1)
abline(v=dif_upper_ci, col="darkgrey", lwd=2, lty=3)
abline(v=dif_lower_ci, col="darkgrey", lwd=2, lty=3)
cal_xseq <- seq(0,3000,0.01)
cal_pdf <- dnorm(x = cal_xseq, mean = mean(mn_all$newborns)-mean(fn_all$newborns), sd = cal_se)
lines(cal_xseq, cal_pdf, col="darkgrey", type="l",lwd=2)
# t test with unequal variances
t.test(mn_all$newborns, fn_all$newborns, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: mn_all$newborns and fn_all$newborns
## t = 2.6935, df = 67.497, p-value = 0.008912
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 384.1804 2581.9339
## sample estimates:
## mean of x mean of y
## 22319.94 20836.89
# t.test(mn_all$newborns, fn_all$newborns, paired = FALSE, alternative = "greater")
#############################################################
# Deep Dive : Unequal variance t test
#############################################################
# sample size
cal_n1 = length(mn_all$newborns)
cat("\ncal_n1:", cal_n1)
##
## cal_n1: 35
cal_n2 = length(fn_all$newborns)
cat("\ncal_n2:", cal_n2)
##
## cal_n2: 35
# degree of freedom
# cal_df = cal_n1 + cal_n2 - 2 # <-- This is df of equal variances
cal_df = ( sd(mn_all$newborns)^2/cal_n1 + sd(fn_all$newborns)^2/cal_n2 )^2 /
((sd(mn_all$newborns)^2/cal_n1)^2/(cal_n1 - 1) + (sd(fn_all$newborns)^2/cal_n2)^2/(cal_n2 - 1))
cat("\ncal_df:", cal_df)
##
## cal_df: 67.49676
# combined standard error: Welch's denominator (It is NOT a pooled standard deviation)
cal_se = sqrt( sd(mn_all$newborns)^2/cal_n1 + sd(fn_all$newborns)^2/cal_n2 )
cat("\ncal_se:", cal_se)
##
## cal_se: 550.6116
# t statistic using combined standard error: Welch's denominator
cal_t = (mean(mn_all$newborns)-mean(fn_all$newborns))/cal_se
cat("\ncal_t :", cal_t)
##
## cal_t : 2.693472
# p-value
cal_p = (1-pt(q = cal_t, df = cal_df))*2
cat("\ncal_p :", cal_p)
##
## cal_p : 0.008912138
# Difference with 95% CI using combined standard error
# Difference's Upper CI
dif_upper_ci = mean(mn_all$newborns)-mean(fn_all$newborns) + qt(p = 0.975, df = cal_df)*cal_se
cat("\nDifference's Upper CI:", dif_upper_ci)
##
## Difference's Upper CI: 2581.934
# Difference's Lower CI
dif_lower_ci = mean(mn_all$newborns)-mean(fn_all$newborns) - qt(p = 0.975, df = cal_df)*cal_se
cat("\nDifference's Lower CI:", dif_lower_ci)
##
## Difference's Lower CI: 384.1804
# Visualise CIs:
plot(density(mn_all$newborns-fn_all$newborns), col="cyan", xlab="Prevalence Difference",
main="PDF", lwd=2, lty=1, xlim=c(0,3000))
# Overlay Difference's CI
abline(v=mean(mn_all$newborns)-mean(fn_all$newborns), col="darkgrey", lwd=2, lty=1)
abline(v=dif_upper_ci, col="darkgrey", lwd=2, lty=3)
abline(v=dif_lower_ci, col="darkgrey", lwd=2, lty=3)
cal_xseq <- seq(0,3000,0.01)
cal_pdf <- dnorm(x = cal_xseq, mean = mean(mn_all$newborns)-mean(fn_all$newborns), sd = cal_se)
lines(cal_xseq, cal_pdf, col="darkgrey", type="l",lwd=2)
Two sample mean test (paired t test):
t.test(mn_all$newborns, fn_all$newborns, paired = TRUE)
##
## Paired t-test
##
## data: mn_all$newborns and fn_all$newborns
## t = 32.368, df = 34, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1389.942 1576.172
## sample estimates:
## mean of the differences
## 1483.057
# t.test(mn_all$newborns, fn_all$newborns, paired = TRUE, alternative = "greater")
[ Tips ] Paired t test essentially evaluates the difference of two samples, assuming ‘the difference’ forming a t/normal distribution.
# Difference between Male and Female newborns
# Normality test:
shapiro.test(mn_all$newborns - fn_all$newborns)
##
## Shapiro-Wilk normality test
##
## data: mn_all$newborns - fn_all$newborns
## W = 0.9522, p-value = 0.1324
# Visualise:
par(mfrow=c(1, 2))
plot(density(mn_all$newborns - fn_all$newborns), col="darkgrey", lwd=2)
hist(mn_all$newborns - fn_all$newborns, probability = T, add = T)
qqnorm(mn_all$newborns - fn_all$newborns, col="darkgrey",
xlab="z Value", ylab="Newborns difference")
qqline(mn_all$newborns - fn_all$newborns, col="red", lwd=2, lty=3)
par(mfrow=c(1, 1))
#############################################################
# Deep Dive : Paired t test
#############################################################
# sample size
cal_n = length(mn_all$newborns - fn_all$newborns)
cat("\ncal_n :", cal_n)
##
## cal_n : 35
# degree of freedom
cal_df = cal_n - 1
cat("\ncal_df:", cal_df)
##
## cal_df: 34
# standard deviation
cal_s = sd(mn_all$newborns - fn_all$newborns)
cat("\ncal_s :", cal_s)
##
## cal_s : 271.0677
# standard error
cal_se = cal_s / sqrt(cal_n)
cat("\ncal_se:", cal_se)
##
## cal_se: 45.81881
# t statistic calculated using standard error:
cal_t = mean(mn_all$newborns-fn_all$newborns)/cal_se
cat("\ncal_t :", cal_t)
##
## cal_t : 32.36787
# p-value
cal_p = (1-pt(q = cal_t, df = cal_df))*2
cat("\ncal_p :", cal_p)
##
## cal_p : 0
# 95% CI using "standard deviation"
# Paired Difference's Upper CI
dif_upper_ci = mean(mn_all$newborns-fn_all$newborns) + qt(p = 0.975, df = cal_df)*cal_se
cat("\ndif_upper_ci:", dif_upper_ci)
##
## dif_upper_ci: 1576.172
# Paired Difference's Lower CI
dif_lower_ci = mean(mn_all$newborns-fn_all$newborns) - qt(p = 0.975, df = cal_df)*cal_se
cat("\ndif_lower_ci:", dif_lower_ci)
##
## dif_lower_ci: 1389.942
# Visualise CIs:
plot(density(mn_all$newborns-fn_all$newborns), col="cyan", xlab="Prevalence",
main="PDF", lwd=2, lty=1, xlim=c(0,3000), ylim=c(0,0.01))
# Overlay CI
abline(v= mean(mn_all$newborns-fn_all$newborns), col="darkgrey", lwd=2, lty=1)
abline(v=dif_upper_ci, col="darkgrey", lwd=2, lty=3)
abline(v=dif_lower_ci, col="darkgrey", lwd=2, lty=3)
cal_xseq <- seq(0,3000,0.01)
cal_pdf <- dnorm(x = cal_xseq, mean = mean(mn_all$newborns-fn_all$newborns), sd = cal_se)
lines(cal_xseq, cal_pdf, col="darkgrey", type="l",lwd=2)
cat("\nPaired Difference's Upper CI:", dif_upper_ci)
##
## Paired Difference's Upper CI: 1576.172
cat("\nPaired Difference's Lower CI:", dif_lower_ci)
##
## Paired Difference's Lower CI: 1389.942
Two sample mean test (paired t test):
# Proportion difference between Male and Female newborns
# Uncomment below to display Male & Female newborns ratio/proportion over years:
# mn_all$newborns/(mn_all$newborns+fn_all$newborns)
# fn_all$newborns/(mn_all$newborns+fn_all$newborns)
t.test(mn_all$newborns/(mn_all$newborns+fn_all$newborns),
fn_all$newborns/(mn_all$newborns+fn_all$newborns), paired = TRUE)
##
## Paired t-test
##
## data: mn_all$newborns/(mn_all$newborns + fn_all$newborns) and fn_all$newborns/(mn_all$newborns + fn_all$newborns)
## t = 47.032, df = 34, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03277953 0.03574025
## sample estimates:
## mean of the differences
## 0.03425989
t.test(mn_all$newborns/(mn_all$newborns+fn_all$newborns),
fn_all$newborns/(mn_all$newborns+fn_all$newborns), paired = TRUE, alternative = "greater")
##
## Paired t-test
##
## data: mn_all$newborns/(mn_all$newborns + fn_all$newborns) and fn_all$newborns/(mn_all$newborns + fn_all$newborns)
## t = 47.032, df = 34, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.03302816 Inf
## sample estimates:
## mean of the differences
## 0.03425989
# Normality test:
shapiro.test((mn_all$newborns - fn_all$newborns)/(mn_all$newborns+fn_all$newborns))
##
## Shapiro-Wilk normality test
##
## data: (mn_all$newborns - fn_all$newborns)/(mn_all$newborns + fn_all$newborns)
## W = 0.95524, p-value = 0.1641
# Visualise:
par(mfrow=c(1, 2))
plot(density((mn_all$newborns - fn_all$newborns)/(mn_all$newborns+fn_all$newborns)), col="darkgrey", lwd=2)
hist((mn_all$newborns - fn_all$newborns)/(mn_all$newborns+fn_all$newborns), probability = T, add = T)
qqnorm((mn_all$newborns - fn_all$newborns)/(mn_all$newborns+fn_all$newborns), col="darkgrey",
xlab="z Value", ylab="Newborns difference")
qqline((mn_all$newborns - fn_all$newborns)/(mn_all$newborns+fn_all$newborns), col="red", lwd=2, lty=3)
par(mfrow=c(1, 1))
<h3>
Quiz:
</h3>
<p>
TBD
</p>
# Write your code below and press Shift+Enter to execute
Double-click here for the solution.
<a href="">
</a>
Connect with the author:
This notebook was written by GU Zhan (Sam).
Sam is currently a lecturer in Institute of Systems Science in National University of Singapore. He devotes himself into pedagogy & andragogy, and is very passionate in inspiring next generation of artificial intelligence lovers and leaders.
Copyright © 2020 GU Zhan
This notebook and its source code are released under the terms of the MIT License.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
<a href="">
</a>
<h3>
Interactive workshops: < Learning R inside R > using swirl() (in R/RStudio)
</h3>
https://github.com/telescopeuser/S-SB-Workshop
<a href="https://github.com/dd-consulting">
<img src="../reference/GZ_logo.png" width="60" align="right">
</a>